Characterizing the Life Cycle of Online News Stories 
Using Social Media Reactions 
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ABSTRACT 

This paper presents a study of the life cycle of news arti- 
cles posted online. We consider user activity both from the 
perspective of their visitation patterns and from their social 
media reactions. We show that we can use this informa- 
tion to characterize distinct classes of articles, and that we 
can use social media reactions to predict future visitation 
patterns early and accurately. We validate our methods us- 
ing qualitative analysis as well as quantitative analysis on 
data from the website of Al Jazeera in English, for a set of 
articles generating more than 3,000,000 visits and 200,000 
social media reactions. We show that it is possible to pre- 
dict the overall traffic an article will receive with the first ten 
minutes of social media reactions; the prediction accuracy is 
equivalent to the one based solely on visits after three hours. 
We also describe significant improvements on the accuracy 
of the prediction of shelf-life for news stories. 

1. INTRODUCTION 

Traditional newspapers have been in decline in recent years 
in terms of readership and revenue; in comparison, digital 
online news have been steadily increasing according to both 
metricslj Recent surveys have shown that about half of the 
population of the US gets their news online, and about one 
third goes online every day for newsQ 

The study of patterns of consumption of online news has 
attracted considerable attention from the research commu- 
nity for over a decade. This research started with the analy- 
sis of access patterns to websites, and has expanded consid- 
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erably to include topics such as the generation of personal- 
ized news recommendations, automatic summarization, en- 
gagement metrics, etc. (see Section [2] for an overview). 

One line of research looks at consumption and interac- 
tion patterns as a single time series and attempts several 
prediction tasks on it. For example, predicting total com- 
ments from early comments [261 [17] , total visits from early 
visits [T2], etc. More recent works incorporate attributes 
from each specific article (e.g. topic, source, etc.) into the 
prediction [3]. 

In our approach, we observe a multivariate time series 
during a brief period of time, and predict some characteris- 
tic of the rest of the series. Specifically, we focus on a se- 
ries describing user behavior around a news article including 
visits, social media reactions, and search/referrals, which is 
observed in the initial minutes following publication. We at- 
tempt to predict the total number of page views of the article 
and its effective shelf-life. We define the effective shelf-life 
as the time span during which the article will acquire the 
majority of its visits. 

The prediction of future visitation patterns is valuable for 
a news organization, as it allows them (i) to gain a better 
understanding of how people consume different types of news 
online; (ii) to deliver more relevant and engaging content in 
a proactive manner; and (iii) to improve the allocation of 
resources to developing stories over their life cycle. 

Our contributions. On this paper we present a qualitative 
and quantitative analysis of the life cycle of online news 
stories. Our main contributions are the following: 

• We characterize two large classes of news stories: break- 
ing news and in-depth articles, and describe the differ- 
ences in users' behavior around them. 

• We describe three classes of short-term response to 
news articles in terms of visits and social media re- 
actions (decreasing, non-decreasing and rebounding). 

• We present predictive models of total visits and shelf- 
life of articles based on short-term users' behavior. 

The remainder of this paper is organized as follows. Sec- 
tion[2]provides an overview of previous works related to ours. 
Section[3]introduces our data collection and defines the con- 
cepts and variables we use. Section [J] describes user behav- 
ior with respect to different classes of articles. Section [5] 



demonstrates the importance of incorporating social media 
information into the predictive modeling of visits. The last 
section includes some concluding remarks. 

2. RELATED WORK 

One of the earliest published studies of user behavior in 
online news was conducted by Aikat [3, who studied the web 
sites of two large newspapers from November 1995 to May 
1997. This work describes many of the patterns still seen in 
news sites today: visits occur mostly during weekdays and 
working hours; readers "skim" pages for information so dwell 
times tend to be short, and there are clear traffic "bursts" 
that can be attributed to specific news developments. 

With the advent in recent years of what can be consid- 
ered as new forms of journalism (blogs) and new propagation 
mechanisms for news (micro-blogs and online social network- 
ing sites), the volume of research publications in this area 
has increased considerably. In this section we overview a 
few previous works closely related to ours, but our coverage 
is by no means complete. 

Behavioral-driven article classification. Previous works 
including [71 [18] that have studied online activities around 
online resources (e.g. visiting, voting, sharing, etc.), have 
consistently identified broad classes of temporal patterns. 
These classes can be generally characterized, first, by the 
presence or absence of a clear "peak" of activity; and sec- 
ond, by the amount of activity before and after the peak. 

Crane and Sornette [7] describe classes of visitation pat- 
terns to online videos, and present models that are consistent 
with propagation phenomena in social networks. Lehmann 
et al. [18] extend these classes by observing that for Twitter 
"hashtags" (user-defined topics) the distributions of activ- 
ity in different periods (before/during/after) induce distinct 
clusters of activity that can be interpreted considering the 
semantics of each hashtag. Romero et al. [23] describe how 
manually-assigned classes of hashtags are related to differ- 
ent shapes of the exposure curve: the probability that a user 
will propagate some information ("retweet" in the case of 
Twitter) after being exposed to the information by a cer- 
tain number of her neighbors. 

Yang and Leskovec [28] describe six classes of temporal 
shapes of attention. Attention is measured in terms of the 
number of appearances of a given phrase (of a variation of 
it) corresponding to an event. The patterns describe the 
distribution of attention over time, as well as the ordering 
in which different types of media (professional blogs, news 
agencies, etc.) will "break" the story. 

In general, previous works have established that the evo- 
lution of the popularity of different on-line items depends 
on their class. Figueiredo et al. [9] describe how YouTube 
videos that are posted to a "top" page on the website, and 
videos that are making use of professionally produced con- 
tent, are different from randomly-chosen videos in terms of 
their visit patterns. 

Recently, researchers at URL shortening service Bit.ly [5] 
described how an article's half-life (see definition in Sec- 
tion[3]) is affected by topics, extending a previous observation 
than in general there are some topics that are more time- 
sensitive than others [11]. For instance, business-related ar- 
ticles have on average a longer half-life, while articles related 
to politics/celebrities/entertainment have an intermediate 
one. Sports-related articles have in comparison a shorter 
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half-life. Previously, Bit.ly researchers [4] have shown that 
this half-life is also affected by the social media platform 
where the link is first posted (e.g. links on Facebook were 
longer-lived than links on Twitter). 

We deepen and complement previous works on behavioral- 
driven characterization of online content, by describing the 
life-cycle of online news articles considering their visitation 
patterns as well as their social media reactions. 

Prediction of users' activity. The prediction of the vol- 
ume of user behavior with respect to an on-line content item 
has attracted a considerable amount of research. This is at- 
tested by a number of papers, some of which are outlined in 
Table[T] Another active topic that is closely related, but dif- 
ferent, is that of predicting real- world variables such as sales 
or profits using social media signals (e.g. [12] and many oth- 
ers). 

Over the years, the models used to predict user behavior 
in social media have increased in complexity. For instance, 
Bandari et al. [3] and Ruan et al. [24] incorporate into their 
models features extracted from the content of the articles, 
such as topics. Yin et al. [21] describe a model that considers 
that voters are divided into two populations: a group that 
conforms to the majority of votes of other people, and a 
group that does not. 

Myers et al. [52 incorporate into propagation models the 
presence of external influences, e.g. traditional media sources 
that can reach vast audiences, such as television networks. 
Huang et al. [TJ consider an online model that evolves over 
time as more information becomes available. 

In contrast with previous works, we focus on the dynamic 
relation between social media reactions and visits over time, 
and show that both are useful to understand the differences 
among classes of articles and to predict future visit patterns. 



Analysis of news visits and social media responses. 

Dezso et al. [8] analyze the visits to a large news portal in 
Hungary. One aspect they study which is closely related to 
our work is the half-life of articles, which is shown to be 
distributed according to a power-law across a broad range, 
with a mean of 36 hours. Agarwal et al. [T] study the ac- 
tions users perforin after reading an article, which include 
printing, commenting, rating, and sharing through e-mail 
or social media. Their focus is on performing personalized 
recommendations, but they also uncover that article topics 
have an effect on the probability of each action, with a divi- 
sion between articles users read privately and articles they 
share publicly: "Users tend to share articles that earn them 
social prestige and credit but they do not mind clicking and 
reading some salacious news occasionally in private." [T] 

Social media reactions to traditional news media can vary 
not only in volume but also qualitatively. Hu et al. [TH] 
record tweets during the broadcast of a speech of the US 
President. They observe that many tweets refer to the speech 
in general, except for certain topics which are discussed in 
more detail. 

Finally, social media optimization company SocialFlow 
describes in a whitepaper [21] a comparative study of social 
media responses to several large media outlets: Al Jazeera, 
BBC News, CNN, The Economist, Fox News and The New 
York Times. Among other findings, they note that the 
probability that a user clicks on a tweet is higher for The 
Economist {~ 19%) than for Fox News (« 16%), Al Jazeera 
(^ 11%) or The New York Times (« 4%). However, follow- 
ers of Al Jazeera are almost twice as likely to retweet article 
links than followers of the other channels. 

In contrast with previous works, we consider jointly traffic 
to the website and social media reactions, as both constitute 
acts in which users engage with the news content. Addition- 
ally, we quantify the richness of Twitter messages over time 
measuring entropy and counting unique tweets, and show 
that these variables are key to more accurate predictions of 
future visits. 

3. CONTEXT AND DATASET 

In this section we provide some context to our research 
and describe the dataset that will be used on the remainder 
of the paper. 

3.1 Traditional news and social media 

Our dataset is provided by Al Jazeera English, a well- 
established news organization that reaches hundreds of mil- 
lions of viewers through its TV channel. The Al Jazeera 
English websitqj is divided into five major sections: News, 
In-Depth, Programmes, Sport, and Weather - plus a col- 
lection of blogs, which is outside the scope of this study. 
Approximately 40 editors/producers work on the areas of 
News, In-Depth, and Programmes. 

As has become in recent years a standard practice of all 
major media organizations, the editors of Al Jazeera main- 
tain Facebook and Twitteilj accounts (we call them "cor- 
porate accounts" in the rest of the paper) and use them 
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actively to announce their content. Each account (face- 
book, com/ al jazeera and OAJEnglish) has over one million 
followers as of November 2012. Using these accounts, arti- 
cles in the news section are shared immediately after being 
posted online. Articles on the In-Depth and Programmes 
sections are shared throughout the day with the goal of max- 
imizing audience reach across multiple time zones. 

The social media accounts re-share articles at different 
times of the day, sometimes up to 4 times, on a schedule 
determined by editors' judgment and designed to increase 
user engagement. Close attention is paid to the wording of 
the items posted in social media, including aspects such as 
their length and the use of hashtags in the case of Twitter. 
Editors use a variety of online tools to obtain low-latency 
analytics of traffic and social media, and to decide which 
hashtags and keywords to use in their postings. 

The visitors to the Al Jazeera website also use social media 
platforms. According to an online survey taken in 2011 that 
received about 4,500 responses, 18% of respondents said they 
used Twitter, and 42% that they used Facebook (12% said 
they used both). 

Social media interactions and traffic to the website can 
complement or substitute each other. Most frequently, they 
complement each other: people click on the shared content 
and visit the website. Sometimes, the social media share can 
be a substitute for a visit to the article, such as when a video 
can be viewed directly on the social media site, or when the 
social media content itself delivers enough information to 
satisfy users without requiring them to click through to the 
full article. 

For instance, the news "Pakistan's Malala now able to 
stand in UK" (19 Oct 2012) generated an unusually large 
number of shares on Facebook, but comparatively little traf- 
fic on the website. At the time, the student-activist was 
being treated from nearly-fatal wounds received ten days 
before, and it is likely that users who were following the 
story just wanted to express their relief or satisfaction at 
her recovery. 

In summary, for Al Jazeera and for most large news orga- 
nizations, social media is important both because it attracts 
more visitors to their website than any other external refer- 
rer, as well as because it provides more platforms in which to 
have an audience. Hence, many news organizations adopt an 
active role in social media in order to increase this positive 
effect. 

3.2 Data collection 

We focus on a period of three weeks between October 8th, 
2012, and October 29th, 2012. The choice of this period is 
not random: it was a relatively stable period of traffic, only 
exhibiting a relatively minor peak on October 29th due to 
Hurricane Sandy. Figure [T] depicts the frequency of visits to 
all the articles in our dataset during the observation period. 

The data collection is done via a "beacon" embedded in 
all article pages: this produces events that are processed us- 
ing Apache S4lj a high-performance system for online pro- 
cessing, which is used to collect and aggregate the visits 
with a 1-minute granularity. For efficiency reasons, only ar- 
ticles obtaining at least 5 visits in a 10-hour window are 
monitored. The collected data is stored using a Cassandr4j 
NoSQL database. 

"http : //incubator . apache . org/s4/J 
http: //cassandra. apache .org/ 
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Figure 1: Frequency of visits during the observa- 
tion period. We selected those for virhich the first 
observation occurred between Oct. 8th-29th, 2012. 
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cs of our dataset. 




Total 


Article avg. 


Number of articles 


606 


- 


Visits after 1 hour 


260 K 


430 


Visits after 1 day 


2.5 M 


4,273 


Visits after 7 days 


3.6 M 


5,971 


Facebook shares 


155 K 


256 


Tweets 


80 K 


133 


Tweet entropy 




5.6 bits 


Fraction of unique tweets 




19.9 % 


Fraction of corporate retweets 




36.8 % 



Our system also collects messages from Facebook and Twit- 
ter. Both platforms have strict limitations on polling fre- 
quencies, which impose a trade-oflf between the number of 
articles we can monitor and the frequency with which we 
monitor them. To obtain more accurate results for popu- 
lar articles, and after experimenting with different settings, 
we decided to poll social media reactions for articles that 
are within the list of the 30 most visited articles during each 
five-minute data collection window. We remark that this list 
varies considerably over time, and is larger than the number 
of new articles published every day. 

We selected a uniform random sample of articles whose 
first visit was recorded during the observation period, and 
kept only those accumulating at least 100 visits during their 
first week after publication. A total of 606 articles was in- 
cluded; this covers over 3.6 million visits and at least 235,000 
social media reactions. Table [2] presents some summary 
statistics on this dataset. 



• Direct links, which have an empty referral and corre- 
spond mostly to people sharing news through instant 
messaging, e-mail, or other non-web application: 11%, 

• Search referrals, basically links from organic search re- 
sults: 5%. 

We remark that these figures correspond to the articles 
in our sample, which do not include the homepage of the 
website, section index pages, or older articles. If we take 
those into account, the numbers are different, e.g. the search 
referrals account for 30% of the visits. 

The fraction of "direct links" in the case of articles is often 
a form of social media reaction, as most of it corresponds to 
sharing by e-mail or instant messaginglj We also collected 
periodically the number of times an article has been shared 
on Facebook, and the content of any Twitter message con- 
taining the URL of the article, or a variant of the URL 
produced by a URL shortening service. We used this data 
to compute the following variables: 

• Number of Facebook shares per minute (interpolated). 

• Number of tweets per minute. 

• Number of unique tweets per minute. A tweet is deemed 
unique if its edit distance with all previous tweets point- 
ing to the same article (after discarding shortened URLs 
and "retweet" prefixes) is more than 10 characters. 

• Tweet vocabulary entropy. To compute this, at any 
given point in time we create a document by concate- 
nating all the tweets received up to that time. Then, 
we compute the entropy of the distribution of terms in 
that document. 

• Number of corporate retweets per minute. A tweet is 
a "corporate retweet" if it includes "RT QAJEnglish" 
or "RT OAJELive" in its text. A tweet can be both 
corporate retweet and unique, as users are free to edit 
the retweet before posting it. 

• Number of followers, friends (foUowees) and statuses 
of each of the users posting a tweet. 

4. BEHAVIORAL-DRIVEN CLASSES 

In this section we describe classes of articles according to 
patterns of user behavior. 

4.1 News vs In-Depth 

There are two large classes of articles that trigger dis- 
tinct user behavior patterns. News (322 articles in our sam- 
ple) and In-Depth (139 articles). We depict in Figure[2]the 
most frequent terms in the titles of these two classes (as in 

e.g. m)- 



3.3 Metrics 

For each article we collected a number of metrics regarding 
user visits and social media reactions. First, we observed at 
a granularity of one minute the number of visits (page views) 
to each article, and the identity of the referral, or previous 
page seen on each visit. We bucketed the latter into four 
classes: 

• Internal links, mostly from the home page of the web- 
site: these are the majority of the tralfic sources and 
comprise 70% of the visits; 

• External links from other sources including social me- 
dia sites, news aggregators, and others: 14%; 
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Figure 2: Most frequent ^vords in article titles in the 
"News" (left) and "In-Depth" (right) sections. 

News articles are dominated by current events, such as 
the confiict in Syria for this period of time, while In-Depth 
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Figure 3: Top: typical profile for a News item: "Pi- 
rates abduct ship's cre^v off Nigerian coast" (Oct 
17th, 2012). Bottom: an In-Depth item: "Teenage 
rights activist shot in Pakistan" (Oct 9th, 2012) 



articles are dominated by photos and analyses of some top- 
ics. 

Figure [3] (top) depicts the typical series for some of these 
variables in a News article. Time is expressed in hours- 
equivalent, which are hours corrected by the seasonality (day- 
night, weekday- weekend) of traffic on the website, as in [25| . 
We can see that initially there are a number of visits and 
activity on Twitter and Facebook, that decays rapidly after 
a short time. This is often the pattern in news media as ob- 
served e.g. by [8] [20] and others. After three or four hours, 
almost all the visits can be explained by "internal traffic", 
i.e. visitors arriving from the homepage of the site. When 
the news article is displaced from the homepage by more re- 
cent items, after about nine hours in this example, its traffic 
slows down considerably. 

The profile of In-Depth articles can be much more com- 
plex. Figure [3] (bottom) depicts one such example, where a 
sustained level of attention in Facebook is observed during 
several hours. 

News items compared to In-Depth items have a more 
intense first hour, as can be seen in Figure (4] The two 
groups look similarly to "promoted" (homepage) and "not- 
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Figure 4: Visits in the first hour versus visits on the 
first week for articles in the tw^o largest categories. 
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Figure 5: Differences in the distribution of Facebook 
vs Twitter shares. On average the ratio of Facebook 
shares to tweets is 1.9:1 (1.6:1 for News, 2.7:1 for 
In-Depth). 



promoted" stories in Digg as observed in [25]. News articles 
are indeed displayed more prominently on the homepage of 
the website, with the most salient location being typically 
used by a news item; however, In-Depth articles are also 
visible across the website, including a prominent slot on the 
top right corner of every page. 

On average the ratio of Facebook shares to tweets per ar- 
ticle is 1.9:1, which is to some extent consistent with the sur- 
vey described in Section [3. II that indicated that there were 
twice as many website visitors using Facebook as there were 
Twitter users. Additionally, In-Depth articles are shared 
more on Facebook given the same level of activity on Twit- 
ter, as shown in Figure[51 On average News articles have 1.6 
Facebook shares per tweet, while In-Depth articles have 2.7. 

As shown in Figure[n]there is also a difference in the num- 
ber of unique tweets. On average, 17% of the tweets about 
News articles are unique, versus 25% of the tweets about 
In-Depth articles. This means that a majority of users do 



0.20 

0.18 

0.16 

0.14 
o 0.12 h 
3 0.10 h 

0.08 



P-, 



0.06 - 

0.04 - 

0.02 \-y 
0.00 



News I I 
In-depth 



/ \ 
/ \ 



0.10 0.20 0.30 

Fraction of unique tweets 



0.40 



Figure 6: Differences in the distribution of the frac- 
tion of unique t^veets. 
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not change the content of the tweets when chcking on the 
"tweet" button next to the articles, or when retweeting from 
another Twitter user. 

There is also a difference in the number of corporate retweets, 
as shown in Figure[7]. On average, 27% of tweets about News 
articles are corporate retweets, compared to 44% of tweets 
about In-Depth articles. This means that for In-Depth ar- 
ticles a larger share of Twitter activity can be attributed to 
users who are followers of ©AJEnglish or ©AJLive, and thus 
are probably more engaged with these Twitter accounts. 

Anecdotally, we know that editors spend more time craft- 
ing tweets to promote In-Depth articles than News articles, 
given that the former are not as time sensitive as the latter. 
In the case of News, the headline is often posted without 
modifications to Twitter, which may produce a compara- 
tively less appealing tweet. 



4.2 Qualitative analysis of news articles 

We begin by examining the attention profiles of articles 
(e.g. Figure [S]) during their first 12 hours-equivalent after 
publication, and categorize them into several classes based 
on common visit patterns. We then proceed to provide a 
qualitative description of each class and the articles that 
comprise it. This analysis is limited to News articles, as In- 
Depth and other article categories have significantly more 
variability in the shape of their traffic profiles, making gen- 
eralization difficult. 

At a high level, the classes of articles in our News sample 
can be roughly described by an "80:10:10 rule". The traffic 
to ~80% of the articles decreases monotonically during the 
first 12 hours, the traffic to ~10% does not decrease, and 
the traffic to the remaining ~10% decreases first, but then 
rebounds. Some example articles are listed in Appendix 1X1 

Decreasing (~78%). 

The largest article class represents about 78% of the sam- 
ple set. Articles in this class demonstrate an initial spike 
in visits following article publication, followed by a rather 
consistent drop in the number of visits, either immediately 
(244 articles), or after a short delay (7 articles). 

Delayed onset traffic decreases have been observed be- 
fore, such as in [20] with respect to the shooting in Au- 
rora, Colorado, in 2012. This attention pattern can often 
be attributed to breaking news that resonates with readers 
located in a time zone that is off-peak when the article is 
first posted, such as when that portion of the audience is 
mostly asleep. A story about Hurricane Sandy's movement 
up the east coast of the United States, for example, sees an 
initially sharp visit growth that begins to decline as the east 
coast retires for the evening. 

The predominance of this class of article indicates that 
while news itself occurs, and can even be covered, at a con- 
stant rate, in most cases readers will only be interested on a 
news article for a brief period of time after its publication. 

Steady or Increasing (~12%) 

Roughly 9% of the sample's articles retain relatively con- 
stant visitor rates during their first 12 hours. Compared 
to news categories with very short shelf-lives, such as sports 
news, these articles are remarkably consistent. In this subset 
of news articles, dramatic news and emotional stories appear 
to garner Facebook shares and, often as a result, extended 
shelf-lives. 

In the U.S., multiple articles on Obama and Romney's 
sharp-tongued presidential debate drive consistent Facebook 
and Twitter responses for a relatively long period of time fol- 
lowing the articles' publication. A poll on racism in the US 
has similar staying power and Facebook sharing. In Cen- 
tral Asia, the Taliban attack on Pakistani schoolgirl Malala 
appears in a number of these articles, where consistent Face- 
book sharing buoys the article traffic beyond average shelf- 
life. In Europe, furor over a seismology scandal is posted 
to Facebook, while in the Middle East, atrocities in the war 
in Syria and violence between Israel and Hamas also gen- 
erate hours of steady traffic. Africa sees a new prime min- 
ister in Libya, the police shooting of 34 striking miners in 
South Africa, and a bomb attack on a church in Nigeria, all 
of which see sustained traffic thanks in part to significant 
Facebook and Twitter sharing many hours after their initial 
publication. 



Stories in this group were mostly developing stories and 
many of them had regular updates. One such example is the 
story about Malala for which Al Jazeera sent a correspon- 
dent to the Swat Valley. Being a complex region to cover, 
a series of news articles and feature stories were written. 
In addition, Al Jazeera reached out to the Reddit commu- 
nity for a Q&A session which topped the "Ask me anything 
(AmA)" sectioifl- 

A relatively small number of articles (3% of our sample) 
buck the usual trend and see increased page trafhc as time 
passes after their publication, rather than a decline. To the 
extent that these articles can be generalized, they resemble 
the class of articles detailed above. Some of these articles 
were also updated with supporting content. For at least half 
of them, web producers added video packages after publica- 
tion, which may explain to some extent the increase in visits. 

Rebounding (~10%). 

About 10% of the articles in our sample initially exhibit 
the expected decline in visits per minute, until a point where 
such decline is reversed. This "rebound" occurs either be- 
cause of internal or external links. 

In the case of internal traffic, the traffic patterns behind 
these rebounding articles sometimes reflect the common news- 
room practice of linking to previous coverage in more recent 
articles. This practice provides additional background con- 
text to readers just arriving at the story, but also helps news 
organizations extract additional value from articles that are 
otherwise statistically becoming valueless. Stories that re- 
quired a significant investment of resources to produce are 
also promoted more heavily than regular articles. We can 
see that in these cases, these internal links do indeed de- 
liver readers to articles whose shelf-lives have nearly expired, 
when measured by homepage and social media traffic. 

The articles that rebound as a result of external traffic 
are beneficiaries of attention directed from outside of the 
news organization (e.g. a social networking site, the web- 
site of another news network, etc.). Typically each observed 
burst in external web traffic can be tracked to a single source. 
Breaking stories can also gain visits as ongoing developments 
drive significant additional interest. This phenomenon is ev- 
idenced, for instance, by three rebounding articles tracking 
Hurricane Sandy's descent upon the United States. 

In general, we see that when News articles cover topics 
that stray from "hard news", the article's attention profile 
reflects the increased variability seen in the In-Depth pieces. 
For example, some articles ostensibly cover speciflc actuali- 
ties, but also bridge into long-standing issues. In the U.S., 
"Immigrant family in pursuit of the American Dream" and 
"Living the modern American Dream" stoke passions around 
immigration. These articles demonstrate more varied fluc- 
tuations in visits over time. 

The sometimes blurry line between reporting on imme- 
diate actualities and longer-term trends like immigration 
is an area of tension in journalism, one identifled by Gal- 
tung and Ruge when they asked "how do 'events' become 
'news"'? [To]. 



5. TRAFFIC PREDICTION USING SOCIAL 
MEDIA DATA 

An increased amount of social media reactions is often 
correlated with more traffic to online articles. This is partic- 
ularly marked in the case of non-decreasing and rebounding 
News articles, as well as In-Depth articles whose visitation 
patterns are more varied and less predictable than regular 
(decreasing) News articles. In this section, we combine so- 
cial media reactions with early visitation measures to pro- 
vide improved predictions of (i) the volume of visits to an 
article after 7 days from its publication and (ii) the effec- 
tive shelf-life of articles, i.e. the time during which they will 
receive most of their visits. 

5.1 Modeling visiting volume 

Our first goal is to determine to what extent social media 
reactions can improve the prediction of the overall popular- 
ity (total number of visits) of an article. The dependent 
variable that we want to describe with our models is the 
total number of visits after 7 days {v7d). We use a straight- 
forward approach to answer this question — linear regression 
models. We include the following variables (described in 
Section FS.Sp as observed at the time at which the prediction 
is performed: number of visits {v), number of visits from link 
referrals and from "direct" traffic {vr, vd), shares on Face- 
book (/), Twitter (f), mean number of followers of people 
sharing on Twitter {foil), entropy of tweets (ent), number 
and fraction of unique tweets {uni, unip) and fraction of 
corporate retweets (corp). We use a linear regression model 
that includes all first-order effects as well as second order 
interactions: 

lm{vis7d ~ v) 

lm(vis7d ~ {v'\'vr+vd+ f -\-t-\- f oll-\-ent~\'Uni+unip+corp) ) . 

The distribution of visits to articles is log-normal dis- 
tributed in our data, consistently with previous works |27l 
[3]. We log-transform log{x-\-l) the visits as well as the 
volume of social media reactions. For t=5, 10, 15, ... we 
calculate the proportion of the explained variance of these 
two linear models. The result is shown in Figure [S] 
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Figure 8: Proportion of explained variance (r^) for 
the prediction of total volume of visits, for Ne^vs and 
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Variable 




In-depth 


News 


Facebook shares 




0.0349 


+ 


0.0204 


+ 


Twitter tweets 




0.0026 


** 


0.0000 


*** 


Twitter entropy 




0.0000 


*** 


0.0003 


*** 


Twitter avg. followers 




0.0000 


*** 


0.7898 




Volume of unique tweets 




0.1086 




0.0000 


*** 


Unique tweets % 




0.2337 




0.0000 


*** 


Corporate retweets % 




0.0092 


** 


0.1551 




Traffic from external links % 


0.2292 




0.1615 




Traffic from c-mail/IM % 




0.3892 




0.8656 




Table 3: Significance 


levels for 


selected regression 


models. 













0.15 



It takes about 3 hours to be able to explain > 0.6 of the 
variance for In-Depth articles, and the additional variables 
are profitable from the first minutes. After 10-20 minutes 
we observe the largest difference in our regression models 
(-1-0.5 in terms of r^). 

We take a closer look at the model variables after 20 min- 
utes to identify the sources of this improvement. For this 
purpose we stepwise fit the model variables by AIC (Akaike 
information criterion) as implemented in stats. step in R. 
Table [3] shows the reliability of the Social Media variables to 
serve as good predictor for the volume of visits after 7 days. 

The fraction of traffic from different sources does not ap- 
pear to be significant when all variables are used for the 
model; when we reduce the model to exclusively these two 
variables, the traffic from e-mail/IM is a more significant 
predictor than the traffic from external links. 

Social media variables seem to be significant, particularly 
the number of tweets and the entropy of the vocabulary 
used in them, as they show high significance both for In- 
Depth and News articles. The number of followers of people 
posting an article on Twitter together with the fraction of 
corporate retweets seem to be particularly important for In- 
Depth articles. A possible interpretation is that the response 
to these articles has a larger component driven by infiuential 
accounts and the actions of Al Jazeera editors. In contrast, 
the number and fraction of unique tweets seem to be signif- 
icant for the prediction of traffic to News articles. A possi- 
ble interpretation is that a rich online discussion around a 
breaking news story can be a significant signal of high user 
interest. 

5.2 Modeling shelf-life 

We define the effective shelf-life at £ of an article as the 
time passed between its first visit and the time at which it 
has received a fraction £ of the visits it will ever receive. In 
this work we set £ — 0.90, but similar values (e.g. 0.85, or 
0.95) yield similar results to the ones presented here. When 
£ = 0.50 this is equivalent to half-life [4l[5]. 

Given that our observation period is finite, we use a seven- 
day observation period as a proxy for the total number of 
visits the articles will ever receive, as for basically all the ar- 
ticles in our sample, there is little activity after 3 or 4 days. 
This is consistent with the experience of Al Jazeera English 
editors and with observations in previous works (e.g. [28]). 
We remark however that there are rare cases where an ar- 
ticle is "re-born" after weeks, for instance when it provides 
background information for a new development. 

As observed in the qualitative analysis, the average shelf- 
life of In-Depth articles, 2 days and 9 hours, is longer than 
the one of News articles, 1 day and 16 hours. Their aver- 
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age half-lives are respectively 20 hours and 8 hours (both 
are shorter than the 36 hours observed in 2006 by Dezso et 

al. m- 

The distribution of the shelf-life for both classes is de- 
picted in Figure [9] We also observe that the shelf- life of all 
articles seems to be in general independent from their total 
number of visits (Pearson's correlation r — —0.03). Con- 
sequently, we expect lower accuracy when predicting based 
solely on visits. 

For the predictive task the linear regression model setup is 
analogue to the one used to predict visiting volume; we use 
the same variables plus one variable containing the predic- 
tion (output) of the model already built. As a performance 
metric, we use the average absolute prediction error. 

The prediction for News after two hours is within 15 hours 
of the actual value for the model that uses social media 
signals (down from about 18 hours). For In-Depth articles, 
there is a larger improvement, going to about 14 hours from 
18 hours. The gains from using a longer observation period 
are not as significant as in the case of predicting the volume 
of visits, as show in in Figure 1101 
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Figure 10: Average error of shelf-life predictions. 



6. CONCLUSIONS 

Results. We have tracked visits and social media reactions 
to articles on a top news website over several weeks, and 
observed that there are two classes of articles that gener- 
ate qualitatively and quantitatively different responses from 
readers. News articles describing breaking news events tend 
to decay in attention shortly after they are published and 
thus have a shorter shelf-life. In-Depth items portraying or 
analyzing a topic tend to exhibit a longer shelf-life and a 
richer social media response, including more shares on Face- 
book for the same level of tweets, and more content-rich 
tweets in terms of vocabulary entropy and fraction of unique 
tweets. 

By going deeper into the first few hours after publica- 
tion of News articles, we found three distinctive response 
patterns in a roughly 80:10:10 proportion: decreasing traf- 
fic, steady or increasing traffic, and rebounding traffic. We 
found that there can be multiple causes for non-decreasing 
traffic, including the addition of new content to articles, so- 
cial media reactions, and other types of referrals. 

Using the signals collected during our study, and starting 
from predictions based on early visitation patterns, we have 
shown that we can improve by a large margin the accuracy 
of predictions of future visits, and improve the accuracy of 
predictions of article shelf-life. In particular for In-Depth 
articles which exhibit more complex visit patterns over time, 
we have found that incorporating social media activities can 
lead to substantial gains in terms of prediction accuracy. 

Practical significance. From the perspective of a news 
provider, the ability to predict the life cycle of stories has 
three main benefits: 

• In the case of News stories, knowing how the audi- 
ence is interacting with an article is not just "nice to 
have", but increasingly a critical component in deliv- 
ering timely and relevant content to an ever growing 
online audience. 

• For In-Depth stories, which operate on a slower news 
cycle, knowing when to allocate additional time and 
resources can significantly improve the news planning 
process. This is particularly useful for an emerging 
class of shows that combine live online discussions with 
more traditional TV coverage. 

• To a web producer, an article with a longer shelf-life 
means judicious time can be spent preparing back- 
grounder pieces which are valuable in providing con- 
text to a story. From a reach perspective, articles 
with steady or increasing levels of traffic translate into 
higher user engagement. 

Our work depends on having access to a large repository of 
social media reactions. As more people get into social media 
(e.g. Twitter), this line of work will become more relevant 
and will be able to produce even higher quality predictions. 

Future work. We combine findings from computer sci- 
ence, journalism, and media studies. The research presented 
here is more difficult to execute than the traditional single- 
discipline study, but we expect interdisciplinary work on this 
area to become increasingly common as our media and tech- 
nology continue to converge. 



In this work, we used linear models and did not attempt 
anything more sophisticated. We do not claim that our mod- 
els are the more accurate that can be built using this data, 
but used them to demonstrate the importance of social me- 
dia signals for the predictive tasks we undertake. Better 
models are definitively possible. 

We also used a data-driven approach in which shelf-life 
is derived from observations. Alternatively, shelf-life can be 
derived by fitting a visitation curve produced by a para- 
metrized model [27] . This may yield an improvement in the 
prediction accuracy. 

Reproducibility. The data sample used for this study, 
including feature vectors, will be made available for research 
purposes at publication time. A web interface to explore the 
data being collected is available onlinelj 
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APPENDIX 

A. EXAMPLE ARTICLES 

Delayed decreasing: 

Hurricane Saindy moves up US Atlantic coast - Americas 
Skydiver lands safely after record jump - Americas 
Third-party candidates spar in US debate - Americas 



Arrests by French police foiled 'bomb plot' - Europe 
Scotland's independence referendum signed - Europe 
Rival protesters clash in Egypt's capital - Middle East 
Syria opposition 'captures' Assad soldiers - Middle East 

Steady: 

Bomb attack hits northern Nigerian church - Africa 

Libya assembly elects new prime minister - Africa 

Police admit 'overreacting' at Marikana - Africa 

Marking the Cuban missile crisis - Americas 

Obama and Romney face off in final debate - Americas 

Obama and Romney meet in combative debate - Americas 

Poll finds fresh increase in US racism - Americas 

US exports to Iran soar despite sanctions - Americas 

Asad Hashim: Ask Me Anything on Malala - Central & South Asia 

Clerics declare Malala shooting 'un-Islamic* - Central & South Asia 

India suspends Kingfisher licence - Central & South Asia 

Pakistani schoolgirl Malala arrives to UK - Central & South Asia 

Profile; Malala Yousafzai - Central & South Asia 

Teenage rights activist shot in Pakistan - Central & South Asia 

Italian seismologists could face jail term - Europe 

Karadzic to begin Srebrenica defence at Hague - Europe 

Russia says fighters killed in North Caucasus - Europe 

Scientists found guilty in Italy quake trial - Europe 

Bomb blast hits Damascus' Old City - Middle East 

Fatah claims victory in West Bank poll - Middle East 

Fighting dims hopes for Syria Eid truce - Middle East 

Hariri calls on Lebanese to attend funeral - Middle East 

Israel strikes Gaza after Hamas retaliation - Middle East 

Marginalisation of disabled people in Egypt - Middle East 

Palestinians vote in municipal elections - Middle East 

Rights group says Syria used cluster bombs - Middle East 

Syrian children killed in Idlib air raids - Middle East 

US and EU urge political stability in Lebanon - Middle East 



Increasing: 



Colombia and FARC rebels launch negotiations - Americas 
Immigrant family in pursuit of American Dream - Americas 
Living the modern 'American Dream' - Americas 
Man charged over attempted US bank bomb plot - Americas 
Minors flee Central American violence - Americas 
Anti-austerity protests erupt in Athens - Europe 
Lithuanians vote out austerity government - Europe 
Scientists await verdict in Italy quake trial - Europe 
Assault on Yemen base blamed on al-Qaeda - Middle East 
Qatari emir in historic Gaza visit - Middle East 



Rebounding: 



AfriccOi and EU leaders to hold Mali summit - Africa 

Evidence of mass murder after Gaddafi's death - Africa 

Nigerian soldiers kill dozens of civilieins - Africa 

State-linked Libyan militias shell Bani Valid - Africa 

Tunisia clash leaves opposition official dead - Africa 

UN urges military action plan for Mali - Africa 

Wounded Mauritania president flown to Paris - Africa 

Argentine crew to vacate ship seized in Ghaina - Americas 

Armstrong 'unaffected' by doping report - Americas 

Biden and Ryan set for crucial VP debate - Americas 

Brazil forces set for raid on Rio slums - Americas 

Candidates spar in US vice president debate - Americas 

Cuba's Castro appears in public - Americas 

First planet with four suns discovered - Americas 

Forecasters predict 'serious' Hurricane Sandy - Americas 

Hurricane Sandy approaches eastern US - Americas 

Tsunami warning for Hawaii lifted - Americas 

US deficit tops $1 trillion for fourth year - Americas 

US East Coast prepares for Hurricane Sandy - Americas 

Dozens dead in Afghanistan Eid suicide blast - Central & South Asia 

Pakistan court probes bartering of girls - Central & South Asia 

Pakistan teen activist in critical condition - Central & South Asia 

Berlusconi vows to remain in political arena - Europe 

Boxer a big hit as Ukraine readies for vote - Europe 

EU leaders agree on banking supervisor - Europe 

Germeiny's Merkel reassures Greece - Europe 

Merkel arrives in Greece amid tight security - Europe 

Russia demands Turkey explain intercepted jet - Europe 

Russian opposition aide arrested - Europe 

Baghdad area hit by more deadly Eid attacks - Middle East 

Eid truce awaits Syrian government response - Middle East 

Kuwait police fire tear gas at protesters - Middle East 

Syrian forces continue to shell Aleppo - Middle East 



